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ABSTRACT 

Understanding the transcriptional regulation of 
microRNAs (miRNAs) is extremely important for 
determining the specific roles they play in signaling 
cascades. However, precise identification of tran- 
scription factor binding sites (TFBSs) orchestrating 
the expressions of miRNAs remains a challenge. 
By combining accessible chromatin sequences of 
12 cell types released by the ENCODE Project, 
we found that a significant fraction (~80%) of such 
integrated sequences, evolutionary conserved and 
in regions upstream of human miRNA genes that 
are independently transcribed, were preserved 
across cell types. Accordingly, we developed a 
computational method, Accessible and Conserved 
TFBSs Locater (ACTLocater), incorporating this 
chromatin feature and evolutionary conservation 
to identify the TFBSs associated with human 
miRNA genes. ACTLocater achieved high positive 
predictive values, as revealed by the experimental 
validation of FOXA1 predictions and by the compari- 
son of its predictions of some other transcription 
factors (TFs) to empirical ChlP-seq data. Most 
notably, ACTLocater was widely applicable as 
indicated by the successful prediction of 
TF->> miRNA interactions in cell types whose chro- 
matin accessibility profiles were not incorporated. 
By applying ACTLocater to TFs with characterized 
binding specificities, we compiled a novel reposi- 
tory of putative TF^- miRNA interactions and dis- 
played it in ACTViewer, providing a promising 
foundation for future investigations to elucidate 
the regulatory mechanisms of miRNA transcription 
in humans. 



INTRODUCTION 

Transcriptional regulation is mediated by c ^-elements and 
the transcription factors (TFs) that bind to these elements 
(1). Precise identification of clv-elements or transcription 
factor binding sites (TFBSs) is fundamentally important 
to decipher the complex transcription regulatory 
networks. Chromatin immunoprecipitation (ChIP) 
followed by high-throughput sequencing (ChlP-seq) has 
become the most powerful tool to profile TFBSs (2,3). 
C«-elements are, however, often active only in specific 
cell types or during certain development stages; therefore, 
a comprehensive catalog of all m-elements would require 
a thorough investigation of various physiological condi- 
tions. Current applications of ChlP-seq are limited by the 
availability of ChlP-grade antibodies, and the reported 
binding sites cannot be determined at nucleotide-level 
resolution. Computational methods have long been 
sought as an alternative to time-consuming experimental 
studies; however, owing to the short, degenerate nature of 
TFBSs, predictions usually contained an overwhelming 
number of false positives (FPs), which has been termed 
the 'futility theorem' (4). To improve the accuracy of pre- 
dictions, existing methods often focus on promoter 
regions of protein-coding genes and have employed 
other criteria, such as conservation, co-operation and 
co-regulation (4,5). Recently, owing to the rapid progress 
in measures of chromatin accessibility by DNase-seq (6,7) 
and FAIRE-seq (Formaldehyde Assisted Isolation of 
Regulatory Elements) (8,9), as well as the elucidation of 
histone modification profiles based on ChlP-seq (10,11), 
chromatin accessibility and histone modification profiles 
of human cell lines have been cataloged (12,13). The ac- 
cumulation of these data has provided an unprecedented 
opportunity to improve TFBS predictions in humans 
and methods incorporating such information have been 
successfully developed to work in a cell-specific manner 
(14-16). However, the human body contains more than 
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200 unique cell types, which could be under various 
physiological and pathological conditions. It is unpracti- 
cal to thoroughly profile the chromatin accessibility and 
histone modification of all types of human cells under all 
conditions, and thus the applicability of such methods is 
limited to cell types whose experimental data are available. 
Currently, the major computational problem to solve is to 
enhance the applicability of predictions. 

While the methods for predicting TFBSs associated 
with protein-coding genes are fairly comprehensive, 
unfortunately, transcriptional regulation genomics of 
non-coding RNAs (ncRNAs), such as microRNAs 
(miRNAs), which have been found to collaborate with 
proteins in essential biological processes, have been 
much less investigated. Accurate maps of TFBSs associa- 
ted with ncRNA genes will be essential for a comprehen- 
sive understanding of protein-coding and ncRNA genes 
coordinate regulation network. miRNAs are a class of 
small ncRNAs, which are expressed in a spatio-temporal 
manner and play key roles in diverse biological processes 
through targeting and suppressing the expression of 
protein genes (17-19). Extensive studies have been per- 
formed on miRNA gene identification, expression 
analysis and function. For example, more than 1000 
human miRNA genes have been documented in 
miRBase (20); expression patterns have been profiled 
from about 200 major human organs and cell types (21) 
and more than 3000 confirmed target genes of human, 
mouse or rat miRNAs are included in miRWalk (22). 
Although previous studies have shown that miRNAs 
were under tight transcriptional regulation (23-27), only 
few investigations have been conducted on the transcrip- 
tional regulation of human miRNA genes. Currently, only 
221 human TF^miRNA (miRNA regulation by TF) 
interactions have been reported in TransmiR (28). This 
certainly represents only a small fraction of the total 
number of regulatory networks, and large-scale investiga- 
tions are needed to decode the full set of pathways gov- 
erning the expression of human miRNAs. Human 
miRNAs are either located in the introns of protein- 
coding genes (intronic miRNAs) or between protein- 
coding genes (intergenic miRNAs). Intronic miRNAs are 
likely to be transcribed along with their host genes (29,30), 
and existing TFBS prediction techniques for 
protein-coding genes can be used for these systems. 
However, for intergenic miRNAs, which are independ- 
ently transcribed, the distances between transcription ini- 
tiation sites (TSSs) and miRNA-coding regions 
dramatically vary, ranging from a few hundred bases 
(31) to 30-kb upstream (32). Also, TFs might not almost 
exclusively bind at proximal promoters (12); it is likely 
that a sufficient number of distal TFBSs would locate in 
regions between miRNA promoters and their coding 
regions providing additional regulatory information. To 
fully explore the transcriptional regulation of human 
intergenic miRNAs, it is necessary to examine large 
genomic regions to locate all the possible TFBSs, and 
thus existing methods employing only simple pattern 
matching (33,34) would lack a reasonable accuracy. 
Even filtered out with significant conservation (such as 
Conserved TFBS track in UCSC human genome 



browser and its web interface, PuTmiR) (35,36), there 
would be still a large amount of FPs due to the presence 
of slowly evolving neutral regions (37). On the other hand, 
recent methods based on cell-specific experimental 
data (38) have improved in accuracy, but the scope of 
their applicability is limited. Prediction methods that are 
suitable for human intergenic miRNAs, particularly those 
with high accuracy and a wide applicable range, are still 
lacking. 

Here, we showed that the conserved and accessible 
chromatin sequences integrated from 12 cell types, 
immediately upstream of human intergenic miRNAs 
found also in the mouse and rat (referred as HMR 
intergenic miRNAs), were highly preserved across cell 
types. Therefore, we developed ACTLocater (Accessible 
and Conserved TFBSs Locater) incorporating known 
chromatin features to identify TFBSs associated with 
HMR intergenic miRNAs. Applying ACTLocater to 
selected TFs showed that the positive predictive values 
(PPVs) of predictions were greatly improved compared 
to conventional methods based on sequence conservation 
alone. Although ACTLocater was based on information 
from a limited number of cell types, it successfully pre- 
dicted TF^miRNA interactions in cells whose chromatin 
accessibility information was not incorporated, suggesting 
that ACTLocater could be applied to a wide range of cell 
types. By using ACTLocater for TFs with known binding 
specificities, we established a comprehensive human 
TF^- miRNA interaction database, ACT Viewer. The re- 
sultant maps provided a solid foundation for understand- 
ing the regulatory pathways underlying HMR intergenic 
miRNA expression. 

MATERIALS AND METHODS 

HMR intergenic miRNA identification and TSS prediction 

All human miRNAs and genome annotations were 
obtained from the UCSC Genome Browser (35) hgl8 
assembly. We considered miRNAs that reside between 
the protein-coding genes as intergenic miRNAs and 
identified human intergenic miRNAs conserved in mice 
and rats by the miRviewer (39). We next adopted the 
TSS predictions of these miRNAs from previous studies 
(26,40,41) and manually calibrated them according to the 
full-length cDNA data, SwitchGear TSS predictions and 
ENCODE promoter-associated histone mark (H3K4me3) 
on nine cell lines track in the UCSC Genome Browser. 
If any full-length cDNAs overlapped with the miRNAs, 
the 5'-terminal region of the full-length cDN A was used as 
the TSS for the corresponding miRNA; or if SwitchGear 
TSS prediction in the miRNA upstream region coming 
along with H3K4me3 peaks present in multiple cell lines, 
the SwitchGear prediction was used; otherwise, TSSs for 
the corresponding miRNAs were directly adopted from 
previous studies. Same-strand miRNAs with common pre- 
dicted TSSs were considered as one transcriptional unit. 

HMR accessible alignments extraction and shuffling 

We extracted the genomic coordinates of the upstream 
40-kb flanking regions or truncated flanking regions 
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of genes upstream of HMR intergenic miRNAs from 
human genome. For clustered miRNAs, the regions 
upstream of each individual sequence were merged. We 
collected a total of 14 DNase-seq and four FAIRE-seq 
datasets (as listed in Supplementary Table S2) from 
the Data Coordination Center (http://genome.ucsc.edu/ 
ENCODE/) of the ENCODE Project (12,13) and 
merged them as chromatin accessibility reference. We 
then extracted Human/Mouse/Rat alignments within the 
chromatin accessibility reference and in the regions 
upstream of HMR intergenic miRNA units from 17-way 
genome-wide multiple alignments using Galaxy (42). 

To calculate the signal-to-noise ratios of FOXA1 pre- 
dictions by ACTLocater and the Conservation method, 
we shuffled (99 runs) HMR accessible alignments and 
Human/Mouse/Rat alignments to generate alignments 
with the same length, base composition and patterns of 
gaps and conservation (43), respectively. The signal was 
defined as the number of FOXA1 -binding sites predicted 
with the true alignments, whereas the noise was defined as 
the number of sites predicted with the shuffled alignments. 

TFBSs prediction by ACTLocater 

To locate the TFBSs from the HMR accessibility align- 
ments, we implemented two searching methods, CScanner 
and MScanner, based on the consensus model and the 
PSSM model (44). 

We designed CScanner to search both strands of 
every human sequence for sequence windows, and its cor- 
responding orthologous sequences in mouse and rat 
matched the consensus. We implemented MScanner to 
score sequences on both strands as potential TFBSs 
using the formula (44): 

species, score(i) = A ,■ log 2 — 

where i is the position within the site, pf, is the relative 
frequency of base b in the genome and fbj is the 
observed relative frequency of base b at that position 
(from the matrix). 

For each species, scores were normalized to a 100-point 
scale. The average score was defined as the arithmetic 
mean of the species scores. Predictions were based on 
the thresholds of the species scores and the corresponding 
average score. 

PSSMs and/or consensus sequences used for selected 
TFs predictions by ACTLocater were as follows: 
for FOXA1, M00131 from TRANSFAC (45), MA0148 
from JASPAR (46) and consensus sequence 
TRTTKRYTY (47); for c-Myc, M00118, M00123 and 
M00615 from TRANSFAC, MA0059 and MA0147 
from JASPAR, and consensus sequence CACGTG (48); 
for the E2F family, M00024, M00050 and M00516 from 
TRANSFAC, MA0024 from JASPAR and consensus 
sequence TTTSCGC (49); for MyoD, a PSSM derived 
from ChlP-seq data (50) (details in 'Supplementary 
Methods' section); for NRSF, M00256 from 
TRANSFAC, MAO 138 from JASPAR and two PSSMs 
from previous studies (51,52); for Oct4, M00135, 



M00138, M00161, M00195 and M00210 from 
TRANSFAC, MA0142 from JASPAR and consensus 
sequence ATGCWAAT (53). The cutoff values for 
species scores and average score for MScanner were set 
to 80 and 85, respectively, for all the PSSMs except 
NRSF. Because this protein recognized a long motif, its 
species and average score cutoffs were set to 75 and 80, 
respectively. 

Validation of predictions by genome-wide ChIP data 

We collected the binding peaks from the ENCODE 
Project (12,13) and previous genome-wide ChIP experi- 
ments (54-56) to evaluate the TFBS predictions. ChIP 
peaks of NRSF have been identified by MICSA (57) 
using the raw data from the ENCODE Project. For each 
TF, we merged all ChIP peaks. We next assessed all pre- 
dictions for FOXA1, c-Myc, the E2F family and NRSF by 
comparing the TFBSs predicted in the human genome 
with the corresponding genome-wide ChIP datasets 
generated by using human cell lines. A prediction was 
considered as a true positive (TP) if the predicted site 
was inside a merged ChIP peak; otherwise, the prediction 
was considered as a FP. We calculated the PPV as 
TP/(TP + FP). A merged ChIP peak was successfully 
predicted (TP) if there was at least one predicted TFBS 
within it; otherwise, the peak was annotated as a false 
negative (FN). And we calculated the sensitivity as TP/ 
(TP + FN). 

We assessed the evidence supported MyoD-binding 
sites by comparing the TFBSs predicted in the mouse 
genome with ChlP-seq datasets obtained from mouse 
cells (50) (ChlP-seq peaks were identified as described in 
'Supplementary Methods' section). And we assessed 
evidence supported Oct4-binding sites by comparing the 
TFBSs predicted in the human and mouse genomes with 
ChlP-seq datasets obtained from human and mouse em- 
bryonic stem cells (26), respectively. 

Definition of accessible chromatin regions for skeletal 
muscle and embryonic stem cells 

We defined accessible chromatin regions of skeletal muscle 
cells by merging the DNase-seq peak regions of HSMM, 
HSMMtube and SKMC skeletal muscle cells and defined 
those of embryonic stem cells by merging the DNase-seq 
or FAIRE-seq peak regions of Hl-hESC, H7-hESC and 
H9ES embryonic stem cells. All the DNase-seq and 
FAIRE-seq datasets were obtained from the Data 
Coordination Center of the ENCODE Project, as listed 
in Supplementary Table S2. 

ACTViewer construction 

To construct the ACTViewer database, we collected 
PSSMs from TRANSFAC, JASPAR and UniPROBE. 
Then, we applied MScanner to the HMR accessible align- 
ments for each PSSM. Thresholds for species and average 
scores were set to 80 and 85, respectively. ACTViewer 
database was developed as described previously (58). 
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Cell culture and stable cell line generation 

MDA-MB231 cells were a gift from Prof. Yan Zhang 
(Sun Yat-sen University, Guangzhou). MCF-7 cells and 
HEK293T cells were obtained from the Shanghai Cellular 
Institute of Chinese Academy of Sciences (Shanghai, 
China). All cells were grown in Dulbecco's modified 
Eagle's medium (G1BCO) supplemented with 10% fetal 
bovine serum (GIBCO) and were cultured in a 37°C incu- 
bator with 5% C0 2 . 

The FOXA1 cDNA lentiviral vector and control empty 
vector were purchased from Fulengen Corporation 
(Guangzhou, China). To produce the lentivirus, we tran- 
siently co-transfected FOXA1 cDNA lentiviral vector 
or empty vector with the ViraPower™ Lentiviral 
Packaging Mix (Invitrogen) into HEK293T cells using 
Lipofectamine 2000 reagent (Invitrogen). MDA-MB231 
cells were infected and then selected with puromycin 
(Sigma-Aldrich). 

Chromatin immunoprecipitation assay 

We performed ChIP assays in MCF-7 cells by using a 
chromatin immunoprecipitation kit (Millipore) according 
to the manufacturer's instructions. Protein-DNA 
complexes were precipitated with a control IgG or an 
anti-FOXAl antibody (ab5089, Abeam). PCR primers 
are listed in Supplementary Table S5. 

Western blotting analysis 

Proteins were extracted with RIPA lysis buffer. FOXA1 
protein was revealed with a polyclonal antibody (C-20, 
Santa Cruz). Signals were detected by the Super Signal 
Western Pico chemiluminescent substrate (Pierce). 
Western blotting of GAPDH on the same membrane 
was used as a loading control. 

miRNA microarray 

miRNA microarrays were performed at the Beijing 
CapitalBio Corporation. Total RNA content extracted 
by TRIzol reagent (Invitrogen) from MDA-MB231 cells 
stably overexpressing FOXA1 or control cells was labeled 
with biotin. Labeled samples were hybridized to 
GeneChip®miRNA Array (V2.0). Raw data were 
normalized and analyzed using the miRNA QC tool 
software (Affymetrix). We considered the HMR intergenic 
miRNAs with a fold change > 1.5 or <0.75 in the FOXA1 
overexpressing MDA-MB231 cells as FOXA1 -affected 
HMR intergenic miRNAs. 

Real-time RT-PCR 

Total RNA content was reverse-transcribed to cDNA 
using the Primescript RT reagent kit (Takara). Real-time 
PCR was carried out using SYBR Premix ExTaq (Takara) 
according to the manufacturer's instructions. The relative 
expression of FOXA1 was calculated using the compara- 
tive 2~ AACt method and was normalized to GAPDH. The 
following primer sets were used: for FOXA1, 5'-GAAGAT 
GGAAGGGCATGAAA-3' (forward) and 5'-GCCTGA 
GTTCATGTTGCTGA-3' (reverse); for GAPDH, 5'-CC 
ATGGGGAAGGTGAAGGTC-3' (forward) and 5'-GA 



AGGGGTCATTGATGGCAAC-3' (reverse). To detect 
the miRNA expression levels, Bulge-Loop™miRNA 
qPCR primer sets and U6 primer sets were purchased 
from RiBoBio Corporation (Guangzhou, China). The 
relative expression levels of miRNAs were normalized to 
U6 snRNA. We performed all real-time RT-PCR experi- 
ments in triplicate. 

RESULTS 

ACTLocater: integrating evolutionary conservation and 
chromatin structure to predict TFBSs associated with 
HMR intergenic miRNAs 

As components of conserved regulatory systems (59), we 
reasoned that HMR intergenic miRNAs rely not only on 
the conservation of the miRNA sequences but also on the 
transcriptional regulation elements. Via sequence examin- 
ation and homology searches, we identified 203 HMR 
intergenic miRNAs, which were grouped into 106 tran- 
scriptional units (Supplementary Table SI). We arbitrarily 
chose a 40-kb region upstream of miRNA as the 
TFBS search area, such space was sufficient as indicated 
by the fact that a significant fraction of TSSs of human 
intergenic miRNAs are within 10-kb upstream regions 
(60,61). 

To explore the features of accessible chromatin se- 
quences upstream of HMR intergenic miRNAs, we 
mapped chromatin accessibility information of 12 cell 
types (Supplementary Table S2) obtained from the 
ENCODE Project (12,13) to the search area. In humans, 
mice and rats, 63.7% of the accessible sequences were 
conserved, whereas only 43.0% of the inaccessible se- 
quences were conserved, indicating that accessible 
regions are rich in functional elements. Conserved access- 
ible sequences were found upstream of 101 HMR 
intergenic miRNAs (Figure 1A and Supplementary 
Dataset SI). We next employed a leave-one-out 
cross-validation method to evaluate the conserved access- 
ible chromatin information across cell types. Each round, 
conserved accessible sequences of one cell type were left 
out, and the sequences of left-out cell type recovered from 
those of the remaining cell types were then examined. The 
percentages of sequences recovered ranged from 74.7% to 
90.0% (Figure IB). On average, for each examined cell 
type, 79.7% of the conserved accessible sequences could 
be recovered from the remaining cell types. Such high 
values were not due to the similarities across cell types 
(Supplementary Table S3). SK-N-SH_RA and SKMC, 
which did not have any similar cell types in our collection, 
still had high sequence recovery values (80.2% and 78.5%, 
respectively). These results indicated that a significant 
fraction of conserved accessible sequences integrated 
from diverse sources were preserved across cell types and 
also implied that such integrated sequences may provide 
valuable clues for other cells whose chromatin accessibility 
has not been investigated. 

Based on these observations, we designed the 
ACTLocater algorithm to incorporate chromatin accessi- 
bility and evolutionary conservation profiles, as shown in 
Figure 1C. First, we constructed a reference for accessible 
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Figure 1. Development of the ACTLocater method. (A) Summary of conserved accessible sequences upstream of 101 HMR intergenic miRNA units. 
The 101 regions upstream of miRNA units were arranged side by side. Within each region, conserved accessible sequences of 12 cell types were 
aligned according to their genomic coordinates. (B) Results of the leave-one-out cross-validation performed in 12 cell types. The horizontal axis 
shows the cell type left out. The vertical axis shows the percentage of conserved accessible sequences recovered by the remaining cell types. The 
average is marked by the dashed line. (C) Flowchart for the ACTLocater method. PSSMs — position specific scoring matrices. (D) The screenshot 
shows genome browser tracks for E-boxes (in human genomic sequence), conserved E-boxes (found in human, mouse and rat), verified E-boxes 
[two E-boxes verified as negative result and positive result in Ma et al. (24), respectively], accessible regions (accessible chromatin regions merged 
from 12 cell types) and placental mammal basewise conservation profile (35) for the 4-kb flanking region upstream of Hsa-miR-lOb. 



chromatin in the human genome by merging the DNase- 
seq and FAIRE-seq peak regions of 12 cell types. Then, 
we extracted multiple alignments within the accessible ref- 
erence regions (referred to as HMR accessible alignments) 
from the Human/Mouse/Rat genome alignments. Finally, 
we identified TFBSs in HMR accessible alignments by 
CScanner and/or MScanner (details in 'Materials and 
Methods' section). 

Initially, the search area contained 9750 conserved 
sequence blocks (~ 1.45 Mb human DNA sequences). 
When restricted to the accessible chromatin reference 
regions, only 3217 blocks (~0.34Mb human sequences) 
remained. As the search area was reduced by 76.6%, the 
prediction specificity was greatly improved. To illustrate, 
we made a simple prediction of the Twists hsa-miR- 10b 
interaction similarly to that in Ma et al. (24) and 
visualized it in the UCSC genome browser (Figure ID). 
Within the 4-kb region upstream of hsa-miR- 10b, we 
identified 12 E-boxes (CANNTG), of which 5 were 
conserved. By comparing the predictions against the 
accessible chromatin reference, only one E-box 



remained, which was the proven Twist-binding site (24), 
demonstrating the high efficiency of ACTLocater. 

Prediction and experimental validation of 
FOXAl-targeted miRNAs 

To assess the performance of its predictions, we applied 
ACTLocater to predict HMR intergenic miRNAs 
regulated by FOXA1. A total of 52 binding sites were 
predicted (Supplementary Table S4). In contrast, the 
Conservation method (applying ACTLocater without 
considering chromatin accessibility) predicted 223 sites, 
indicating that incorporating chromatin accessibility in- 
formation can greatly reduce the number of predictions. 

To evaluate the significance of the predicted FOXA1- 
binding sites, we repeated predictions with shuffled in- 
put alignments. As shown in Figure 2A, the predictions 
of ACTLocater had significantly higher signal-to-noise 
ratios than those of the Conservation method 
(Two-sided two-sample Student's f-test, P = 2.2 x 10~ 19 ). 
We then used FOXA1 -binding peaks from previous 
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Figure 2. Prediction and validation of FOXA1 -targeted HMR intergenic miRNAs. (A) A boxplot shows the signal-to-noise ratios of FOXA1 
predictions by the Conservation method and ACTLocater method, respectively. ***/>< 0.001. (B) Examples of ChlP assays in MCF-7 cells. 
Primers against predicted regions were used to amplify DNA immunoprecipitated with FOXA1 antibody (lane 3) or control rabbit IgG (lane 2). 
Genomic DNA was used as a positive control (lane 1). Site P2 validated by previous studies (54-56) and site NC with no evidence of FOXA1 binding 
(54) were used as a positive and a negative control for ChlP enrichment, respectively. (C) Real-time RT-PCR and (D) western blotting analysis 
of FOXA1 from MDA-MB231 cells stably expressing empty vector or FOXA1. Error bars indicate s.e.m.; */ > <0.05. (E) Venn diagram of 
FOXA1 -targeted HMR intergenic miRNA units identified by ACTLocater or previous genome-wide ChlP studies. miRNA units found only in 
the genome-wide ChlP studies were classified according to the chromatin accessibility and conservation of corresponding ChlP peak regions. (F) Fold 
changes of randomly selected and negative control miRNAs in FOXA1 overexpressing MDA-MB231 cells relative to the empty vector controls. 
miRNAs with multiple locus in the human genome are marked underlines. Hsa-miR-147a and hsa-miR-150-3p, which there were no FOXAl-binding 
sites associated with, were used as negative controls. Error bars indicate s.e.m.; two-sided one-sample Student's r-test; */ > <0.05; **/ > <0.01; 
***P< 0.001. (G) Classification of predicted FOXAl-binding sites. Sites overlapped with ChlP peaks of H3K4mel, H3K4me3 or Pol II are 
marked as solid circles, otherwise are marked as unfilled circles. The plot shows the distances (absolute value) between sites and their corresponding 
TSSs. The TSS of the first miRNA unit is not available. **P<0.01. 
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genome-wide ChIP studies (54-56) to evaluate the quality 
of these predictions. Of the binding sites predicted by the 
Conservation method, 13.9% (31/223) could be validated 
against the known set of binding sites. For ACTLocater, 
the number of predicted sites reduced to 52, 15 of which 
were known binding sites and the PPV increased to 
28.8%. The sensitivities of the Conservation method and 
ACTLocater were 15.6% and 8.4%, respectively. 
ACTLocater demonstrated substantially higher specificity 
than the Conservation method, while reducing the sensi- 
tivity. We next conducted ChIP assays in MCF-7 cells to 
verify predictions, which were not validated by previous 
genome-wide ChIP studies. Representative examples of 
ChIP results are shown in Figure 2B, and full results can 
be found in Supplementary Figure SI. 81.1% (30/37) of 
these sites were confirmed in MCF-7 cells, and thus the 
overall PPV of ACTLocater predictions was 86.5% (45/ 
52), emphasizing its specificity in predicting FOXA1- 
binding sites. Furthermore, these results also indicated 
that ACTLocater could discover novel sites which were 
missed by high-throughput experimental methods. 

To examine the functional importance of the pre- 
dicted sites, we employed miRNA arrays to monitor the 
miRNA expression changes in FOXA1 overexpressed 
MDA-MB23 1 cells (Supplementary Table S6). FOXA1 
overexpression was validated by real-time RT-PCR 
(Figure 2C; two-sided two-sample Student's /-test, 
P = 0.0048) and western blotting (Figure 2D). We found 
that 50.0% (20/40) of the HMR intergenic miRNA units 
predicted by ACTLocater and 46.3% (31/67) of those 
found by previous genome-wide ChIP studies (54-56) 
showed a change in expression. 90.0% (18/20) of the 
FOXA1 -affected HMR intergenic miRNA units predicted 
by ACTLocater were also identified by the genome-wide 
ChIP studies, indicating that the miRNA units detected by 
ACTLocater correlate well with those found by previous 
genome-wide ChIP studies. However, 13 miRNA units 
detected by the genome-wide ChIP studies were not pre- 
dicted by ACTLocater. A detailed examination of the 
genomic features of the corresponding ChIP peak 
regions showed that these sequences lacked a conserved 
FOXAl-binding motif (Figure 2E), which is necessary for 
ACTLocater prediction. Finally, we used a more sensitive 
method, real-time RT-PCR, to further investigate the 
effect of ectopic expression of FOXA1 on miRNAs. We 
randomly selected 22 miRNA units and chose one 
member miRNA for testing from each unit. 77.3% 
(17/22) of the randomly selected miRNAs showed sig- 
nificant changes in expression levels, while two negative- 
control miRNAs showed no significant changes 
(Figure 2F). Hsa-miR-130a, of which the predicted site 
was not experimental supported, was likely regulated by 
FOXA1 indirectly. It should be noted that some of the 
examined miRNAs have multiple copies in the human 
genome, and the detection of real-time RT-PCR cannot 
ensure that the expression changes of multi-locus miRNAs 
were from the locus with predicted FOXA1 -binding sites. 
When considering the miRNAs associated with experi- 
mental supported FOXAl-binding sites and derived 
from a unique locus, there are still 73.3% (11/15) 
miRNAs showed significant changes. These results 



indicated that most of the predicted FOXAl-binding 
sites were functional. Hsa-miR-202, hsa-miR-129-3p and 
hsa-miR-24, members of three miRNA units uniquely pre- 
dicted by ACTLocater, all showed a significant change in 
expression levels, suggesting that ACTLocater could 
predict novel functional FOXAl-binding sites missed by 
genome-wide experimental assays. 

Distinct histone signatures have been found for 
proximal promoters and distal enhancers. Previous 
studies have shown that strong H3K4mel and 
H3K4me3 signals were associated with promoters, 
whereas strong H3K4mel and weak H3K4me3 signals 
were associated with enhancers (10,62). To examine 
whether ACTLocater could predict not only proximal 
but also distal FOXAl-binding sites, we collected 
ChlP-seq peaks of histone modifications from various 
cells released by the ENCODE Project (12,13) (datasets 
are listed in Supplementary Table S2) and used them to 
classify the FOXAl-binding sites. Of the 51 predicted 
binding sites with histone modifications available, 37 
were classified as proximal TFBSs and 14 were classified 
as distal TFBSs (Figure 2G). These classifications were 
supported by the fact that proximal TFBSs were usually 
associated with the Pol II ChlP-seq peaks, but not the case 
for distal TFBSs (Two-sided Fisher's exact test, 
P = 0.00044; Pol II ChlP-seq datasets were obtained 
from the ENCODE Project, as listed in Supplementary 
Table S2). Moreover, the distances of proximal TFBSs 
from the corresponding TSSs were significantly less than 
distances between distal TFBSs and their TSSs (Two-sided 
two-sample Student's /-test, P = 0.0028). These data 
indicated that ACTLocater predicted both proximal and 
distal FOXAl-binding sites. Six sites were found to be 
located in genomic regions between TSSs and miRNAs, 
suggesting that these regions could contain additional 
functional regulatory elements. 

Evaluation on other selected TFs 

To further assess its performance, we applied ACTLocater 
to identify the target miRNAs of c-Myc, the E2F family 
and NRSF (also known as REST). The program predicted 
29 c-Myc, 25 E2F and eight NRSF-binding sites 
(Supplementary Table S7, S8 and S9), corresponding to 
22 c-Myc^miRNA, 18 E2F—> miRNA and seven 
NRSF^miRNA interactions, respectively. The PPVs of 
these predictions based on ACTLocater were all superior 
to that of the Conservation method, while the sensitivities 
of these two methods were comparable (Table 1). In 
previous studies (23,25), 10 HMR intergenic miRNA 
units have been identified as directly transcriptional 
targets of c-Myc in P493-6 cells. ACTLocater successfully 
predicted 50% (5/10) of these miRNA units. As the c-Myc 
binding was assayed in conserved regions without con- 
sidering the presence of a binding motif in Chang et al. 
(25), we examined the binding regions (PCR regions in the 
previous study spanned 1 kb) of the missing units and 
found that c-Myc binds to these regions via non-canonical 
motifs, causing them to be overlooked by ACTLocater. 
Additionally, some c-Myc predictions not validated by 
ChlP-seq were supported by other studies. For example, 
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Table 1. Summary of FOXA1, c-Myc, the E2F family and NRSF predictions by the conservation method and ACTLocater 



TF 




Conservation 11 






ACTLocater 




No. sites b 


PPV C (%) 


Sensitivity 11 (%) 


No. sites b 


PPV C (%) 


Sensitivity 11 (%) 


FOXA1 


223 


13.9 


15.6 


52 


28.8 


S.4 


c-Myc 


41 


51.2 


5.8 


29 


72.4 


5.8 


E2F 


38 


47.4 


10.2 


25 


64.0 


9.4 


NRSF 


17 


52.9 


33.3 


8 


87.5 


25.9 



u The Conservation method was the same as ACTLocater but without considering chromatin accessibility reference. 
b Number of sites predicted by the Conservation method and ACTLocater, respectively. 

C PPV and d Sensitivity of FOXA1 predictions were assessed by the genome-wide ChlP studies (54-56) and that of remaining TFs were assessed by the 
corresponding ChlP-seq datasets obtained from the ENCODE Project, as listed in Supplementary Table S2. 



H19 RNA, the precursor of hsa-miR-675 and hsa-miR-9-3 
have been found to be direct targets of c-Myc (63,64). 
These results showed that ACTLocater can yield consist- 
ent high-specificity predictions. 

Assessment of the applicability of ACTLocater 

To determine the scope of our method, we used 
ACTLocater to predict cell-specific MyoD^miRNA 
and Oct4^miRNA interactions that occur in skeletal 
muscle and embryonic stem cells, respectively. 
Meanwhile, we generated two predictions with the chro- 
matin accessibility data of skeletal muscle and embryonic 
stem cells and used them as independent verifications of 
the ACTLocater predictions. 

To test whether MyoD^-miRNA interactions could be 
predicted by the chromatin accessibility data of 
non-muscle cell types, we removed the chromatin accessi- 
bility data of SKMC cell from our chromatin accessibility 
reference. A total of 22 MyoD-binding sites were pre- 
dicted by ACTLocater (Figure 3A and Supplementary 
Table S10). Also, we made a similar prediction with the 
chromatin accessibility data of skeletal muscle cells and 
yielded 26 MyoD-binding sites (Figure 3A and 
Supplementary Table Sll). To avoid the potential differ- 
ences in FP predictions between two datasets, we only 
considered sites with evidence supported. As shown 
in Figure 3B, 71.4% (10/14) of evidence supported sites 
(50) made with the chromatin accessibility data of skeletal 
muscle cells could be successfully predicted with those of 
non-muscle cells. 

We next performed a similar analysis between the two 
Oct4 predictions, which were made with chromatin acces- 
sibility data of non-stem cells (our chromatin accessibility 
reference) and embryonic stem cells, respectively. Both 
predictions reported 16 Oct4-binding sites (Figure 3C; 
Supplementary Table S12 and SI 3). As shown in Figure 
3D, 77.8% (7/9) of evidence supported sites (26,65) made 
with the chromatin accessibility data of stem cells could be 
successfully predicted with those of non-stem cells. 

Taken together, most cell specific TF^-miRNA inter- 
actions were successfully predicted, indicating that 
ACTLocater could be applied to a wide range of cell 
types, such as cells whose chromatin accessibility profiles 
had not yet been characterized. 



A 




No evidence supported site Evidence supported site 
B D 




Figure 3. Applicability assessment of ACTLocater. (A) Summary of 
MyoD-binding sites predicted with chromatin accessibility data of 
skeletal muscle and non-muscle cells, respectively. Sites are aligned ac- 
cording to the genomic coordinates. (B) Venn diagram of evidence 
supported MyoD sites. (C) Summary of Oct4-binding sites predicted 
with chromatin accessibility data of embryonic stem and non-stem cells, 
respectively. Sites are aligned according to the genomic coordinates. (D) 
Venn diagram of evidence supported Oct4 sites. 



ACTViewer: regulatory map of HMR intergenic miRNAs 

Given that ACTLocater achieved high PPVs and could 
be widely applied, we constructed a regulatory map com- 
posed of TFBSs associated with HMR intergenic miRNAs 
by applying ACTLocater to the PSSMs from 
TRANSFAC (45), JASPAR (46) and UniPROBE (66). 
We found that different names in these database entries 
sometimes corresponded to TF isoforms or even to the 
same TF, which poses a serious challenge to consistent 
TF nomenclature. For the present study, we separated 
the predictions with the names and accession numbers 
retained. We retrieved a total of 295 PSSMs from 
TRANSFAC and predicted 13 688 TFBSs, corresponding 
to 4085 TF^miRNA interactions. Additionally, we 
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Figure 4. ACTViewer: a database documented the ACTLocater predictions on available PSSMs. (A) A snapshot of the ACTViewer genome 
browser. The controls (at the top of the browser) position the browser over a specific region in the genome. Annotations are displayed as individual 
tracks along the genomic regions. (B) A pop-up window containing detailed information about a predicted HNF4A-binding site. (C, D and E) 
Snapshots of searching ACTViewer for target miRNAs of NF-kB (C), TFBSs associated with hsa-miR-101-1 (D) and c-Myc^hsa-miR-17 inter- 
action (E). Search results for target miRNAs of NF-kB and TFBSs associated with hsa-miR-101-1 are partially shown. 



retrieved 129 PSSMs from JASPAR and predicted 13 170 
TFBSs, corresponding to 3320 TF^miRNA interactions. 
Finally, we retrieved 389 PSSMs from UniPROBE and 
predicted 938 TFBSs, corresponding to 724 
TF^miRNA interactions. Although false predictions 
could not be avoided, a large number of predictions 
were assumed to be true TFBSs according to the high 
precision of ACTLocater, indicating that a complex regu- 
latory network governing the expression of HMR 
intergenic miRNAs remains to be decoded. 

To provide an effective view of the TFBSs predicted by 
ACTLocater, we developed the ACTViewer database 
(ACTViewer is accessible at http://deepbase.sysu.edu.cn/ 
ACTViewer/). We provided a genome browser and search 
forms in ACTViewer to facilitate data access. Predictions 
from ACTLocater could be viewed using the ACTViewer 
browser (Figure 4A). The browser displayed annotation 
tracks beneath genome coordinate positions, allowing 
rapid visual correlation of different types of genome an- 
notations. Zooming and scrolling controls helped to 
narrow or broaden the displayed chromosomal range to 
focus on the exact region of interest. We displayed 
ACTLocater predictions as tracks according to the 
PSSMs data sources and added secondary links from 
entries within prediction tracks to lead to corresponding 
details from each entry (Figure 4B). With the ACTViewer 
browser, users can conveniently examine TFBSs that are 
clustered, overlapping or located in special genomic 
regions. To access the predictions made about a specific 



TF or miRNA, we also provided a list of search forms 
according to TF and/or miRNA. A query using a specific 
TF could retrieve the miRNAs predicted as its target 
genes, and using a specific miRNA could allow users to 
find the predicted TFBSs associated with the miRNA 
input (Figure 4C and D, respectively). Additionally, 
users could retrieve interactions from ACTViewer by spe- 
cifying a TF and a miRNA simultaneously (Figure 4E). 
In conclusion, ACTViewer provided interfaces for 
integrating visualization and rapid retrieval of the 
ACTLocater predictions. 

DISCUSSION 

This work has presented a new computational method 
called ACTLocater that integrates evolutionary conserva- 
tion and chromatin structure to predict TF^miRNA 
interactions with significantly improved PPVs. Most 
notably, ACTLocater uses chromatin accessibility data 
from a limited number of cell types for reference, but 
could also be applied to other cell types whose chromatin 
accessibility has not yet been characterized. Thus, the 
ACTLocater method and the ACTViewer resource have 
considerable potential to advance studies of the miRNA 
transcriptional regulation that underlies diverse human 
biological processes. 

Previously, several studies have also attempted to 
predict the TFBSs associated with human miRNAs. For 
example, miRGen (33) has mapped all vertebrate TF 
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matrices from TRANSFAC to the regions spanning 5-kb 
upstream to 1-kb downstream of the miRNA TSSs. 
MIR@NT@N (34) has predicted potential TFBSs in the 
promoter regions of human miRNAs based on PSSMs 
from JASPAR and the 'CpG island' signal. However, 
these methods are based on simple pattern matching and 
are thus prone to false predictions. Until this point, the 
comparative genomics approach, such as the TFBS 
conserved track in the UCSC genome browser (35), was 
the most efficient method to predict potential TFBSs 
associated with miRNAs. As shown by our results for 
several TFs, ACTLocater yields a significant improvement 
in PPV compared to the comparative genomics approach 
using the same parameters (improvement in PPV could 
also be observed by comparing selected TF predictions 
to the UCSC conserved TFBS track, and the benefit of 
incorporating chromatin structure features could be 
noticed by comparing these predictions of the UCSC 
conserved TFBS track before and after filtering by our 
chromatin accessibility reference; see Supplementary 
Figure S2). Owing to the accumulation of cell-specific ex- 
perimental data, computational methods integrating this 
information have also been developed to identify human 
TFBSs associated with miRNAs or on a genome-wide 
scale. For example, a method based on sequence 
features, histone modifications, and DNase 1 hypersensi- 
tivity has been developed and applied to human CD4 + T 
cells (15). MITF-regulated miRNAs have also been suc- 
cessfully identified by combining nucleosome information 
and motif matching in human melanoma cells (38). 
Furthermore, CENTIPEDE (16) has been used in 
human lymphoblastoid cells by incorporating DNase I 
hypersensitivity. In contrast to these cell-specific 
methods, ACTLocater was based on a chromatin accessi- 
bility reference, which was derived from a panel of cell 
lines and was highly preserved across cell types, making 
ACTLocater widely applicable. 

ChlP-seq has been considered the state-of-the-art 
experimental technique for profiling TFBSs including 
species-specific and non-canonical binding sites. 
Additionally, ChlP-seq has been able to avoid ambiguity 
for TFs that share similar binding preferences; however, 
ChlP-seq has to be carried out for only one TF using one 
set of conditions at a time. Moreover, because of the 
sequencing depth and performance of data analysis, 
some true binding sites are likely to be missed (67,68). 
ACTLocater, by contrast, can evaluate all TFs simultan- 
eously, to provide nucleotide precision and to uncover 
novel binding sites missed by ChlP-seq assays. Because 
of these abilities, ACTLocater and ChlP-seq technology 
could be complementary tools to provide exhaustive infor- 
mation regarding c/'.v-regulatory networks. 

Although the performance of our method is 
encouraging, integrating more source data is likely to 
improve the ability of ACTLocater. Currently, chromatin 
accessibility datasets from only 12 cell types were used; we 
expect that by incorporating more chromatin accessibility 
datasets from other cell types released by the ENCODE 
Project (12,13), the coverage of all possible accessible and 
conserved regions will increase (improvement in the 
coverage of these regions by increasing cellular datasets 



can be seen in Supplementary Figure S3), and thereby 
improve the sensitivity of ACTLocater. Another limita- 
tion of ACTLocater is its reliance on binding motifs for 
the recognition of TFBSs. The human genome encodes 
about 1400 TFs with sequence-specific DNA-binding 
properties (69), and binding preferences were only avail- 
able for a small proportion of these factors. Establishing 
the binding specificities of these TFs using techniques such 
as protein-binding arrays and high-throughput SELEX 

(70) will undoubtedly expand the predictive ability of 
ACTLocater. 

Currently, ACTLocater focuses on the transcriptional 
regulation of intergenic miRNAs in humans, but modified 
algorithms based on the same principles could be applied 
for other genes or on a genome-wide scale as long 
as the required data are provided. Furthermore, the 
multiple genome alignments of Drosophila and 
Caenorhabditis species were already available (35), and 
the monENCODE Project continues to produce a large 
number of chromatin accessibility data of D. melanogaster 

(71) and C. elegans (72). Integrating these datasets, 
ACTLocater could also be expanded to these species. 

SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online: 
Supplementary Tables 1-13, Supplementary Figures 1-3, 
Supplementary Methods, Supplementary Dataset 1, 
Supplementary Software 1 and Supplementary Reference 
[73]. 
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